-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM64-SVE: Add MultiplyAddRotateComplexBySelectedScalar
#105002
Conversation
Note regarding the
|
Note regarding the
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
@@ -190,6 +190,7 @@ HARDWARE_INTRINSIC(Sve, MinNumberAcross, | |||
HARDWARE_INTRINSIC(Sve, Multiply, -1, 2, true, {INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_fmul, INS_sve_fmul}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_OptionalEmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation) | |||
HARDWARE_INTRINSIC(Sve, MultiplyAdd, -1, -1, false, {INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation|HW_Flag_FmaIntrinsic|HW_Flag_SpecialCodeGen) | |||
HARDWARE_INTRINSIC(Sve, MultiplyAddRotateComplex, -1, -1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_fcmla, INS_sve_fcmla}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_HasImmediateOperand) | |||
HARDWARE_INTRINSIC(Sve, MultiplyAddRotateComplexBySelectedScalar, -1, 5, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_fcmla, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_HasImmediateOperand|HW_Flag_LowVectorOperation|HW_Flag_HasRMWSemantics|HW_Flag_SpecialCodeGen|HW_Flag_SpecialImport) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have HW_Category_SIMDByIndexedElement
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that category makes sense, though this API seems different enough from other HW_Category_SIMDByIndexedElement
APIs that we would have to add several special cases on that category's path (for example, all the 5-operand handling, plus the fact that we index the third vector register in pairs of elements, so the default bounds calculation for the index doesn't work). Do we want to move this logic over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synced offline...ok to skip this but to add a comment in lsraarm64.cpp
explaining why we do not mark it such and why we just need 1 internal register although we have 2 immediates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please monitor the TP cost and also check the memory cost by adding op5?
Sure: Looks like the TP cost is trivial. Is there a tool we use to measure diffs in memory usage? |
ignore that. we have updated the struct which gets allocated on the stack and not from arena allocator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
/ba-g blocking failures are timeouts |
Part of #99957. I had to add a new test template to cover APIs with three vector args, and two immediates; the new template is identical to
_SveImmTernOpTestTemplate.template
, except that I added additionalImm
andInvalidImm
arguments. JIT changes are a bit more involved for this API because it is the first ARM64 intrinsic with five operands. This means theHWIntrinsic
struct now has an additionalGenTree*
member. If the TP impact of this is too high, I can get rid of the new member; the code for handling the fifth operand will look a bit less elegant, but it's only for this API, for now.With @kunalspathak's fix in #104998 merged in locally, all stress tests pass. The SVE encoding unit tests passed, too (I had to tweak the emitter and tests to take the rotation value in its encoded form to match the API-level behavior). Since the jump table generation is a bit odd for this API, here's what the codegen looks like:
@dotnet/arm64-contrib PTAL, thanks!